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ABSTRACT 

GANDALF (Genera.l Alpha Numeric Direct Access Library 
Facility) is an information retrieval system designed and implemented 
at the University of New Mexico for the purposes of retrieving 
abstracts from large abstract data bases^ such as the ERIC system, 
Prt.vious batch-process information retrieval systems for use" with the 
ERIC data base have been extremely slow, and thus expensive of 
computer time. Gandalf uses the user request to produce a list of 
addresses within the overall data base, so that only a small subset 
of the material is selected, and processing of unreferenced material 
is avoided. Furtherioore, since GANDALF was designed to be used by 
persons with little or no computer experience, an attempt has been 
made to make the request statements as simple to use as possible. In 
comparisons runs, GANDALF was from ten to forty times as fast as 
QUERY (the currently available ERIC search system) in real tiitie, and 
four to 77 times as fast in computer time. (Author/RH) 
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^REFAC[: 

The following is a portion of an in-d'epth report describing 
GANDALF. Those persons (lesiring :nore inforiiiatK)n are invited 
to contact the authors or: 

Jonathan D. Embry 
Director, Computer Services 
Southwest Research Associates 
212 Bryn Mawr NE 
Albuquerque, New Mexico 87106 

Our thanks to Profes^^or Don Morrison, Wilt Byruni, Ken Frieden- 
back and the staff of the U.N.M. Computing Center and the 
College of Education. 



INTRODUCTION 



GANDALF (Gofieral AlphnNumeric Direct Access Libv.trv. 
facility) is an information retriaval system desiqr-e and 
implemented at the University of New Mexico (U.Nrl) for the- 
purpose of retrieving abstracts from large abstract data 
bases, such as the ERIC (Education Resource Infonnatioi; 
CI eari nqhouse ) system. 

The inUial interest in working with the retrieval of 
selected information from large scale data bases the size 
of the ERIC system began after observing the large quantity 
of computer time being used by a retrieval nrograr'. QUERY, 
provided by the U.S. Office of Education in conjunction 
with the ERIC data base of approximately 75,000 abstracts 
on an IBM 360/67. 

Upon examination of QUERY, it was noted that although 
the abstracts were stored on direct access devices (four 
IBM 2314 disk packs) the. search process was sequential 
with no use of any type of indexes. Thus, the basic prob- 
lem, was determ i ned : the data base had outgrown its access 
method , / 

It was decided that if an access method like GANDALF 
wa:D; to be developed, it would be far more effective if 

it included ''modularity" as a principal feature. This 

t, ■ - . ■ 

modularity feature would' give GAMDALF the. ability to access 



a wide variety of date bases (inciudino ariy set of iiiacliifie 
readable records, generally in narrative or natura 1 -1 anqu.cKie 
format, such as ERIC files and CHEMISTRY ABSTRACTS) with a, 
minimal amount of modification. Section:; of this 

paper discuss those modules and how they interface. Section 

describes the current status of impl ementation and 
future additions being considered to improve and extend 
GANDALF. * ' 

The reason For having computers process these kinds of 
data ba-fes is the need to select a comparatively small 
group of records from a very large data bese. GANDALF was 
designed to be used by people with little or no prior 
computer experience, the only user requirements being that 
the user have kno-wl edge , of hi s needs and the contents of 
the data base beinn used. T li e s e 1 e c t i o n criterion, which 
is called a REQUEST, is written with one or more KEYWORDS 
that the desired elements of the data base contain cr^ 
log cally relate to. For example, a REQUEST for the author 
J. Smith is actually a request for that set of records in 
the data base that contain the" character string 'J.SHITH' 
in the author field. A KEYWORD, then, can be any character 
string that could be used to reference a record, such as 
COMPUTER ASSISTED INSTRUCTION, ENGLI SH ( SECOND LANGUAGE), 
andsoon. 

In order to reduce the volume of this paper, frequent 



references are made to techniques nf'J iM'ocosses (siicn 
rev:orso Polish ^ota^ion and recursive pro(jraiii};!l n.j ) w i Ui 
which the reader may not be famtlinr. If-, thirs situation 
arises,, the reader is referred to the literature as a 
source of background i nf ornia ti on . The .luthors are avail 
able for ony type of additional assistance which may be 
desired. 



OVERVICW- OF GANOALr PLSiGN. 

The goals of the GANDAIF project, w-ro to produce an 
1 nf oru'idt ion retrieval system v/ith the f ell owi ng ■ cha ra c ~ 
teristics: 1) a user-oriented request !.^r^ouage which would 
assist users in retrieving infornu^tion and mininiize usei' 
inconvenience and f rustrat i :)n ; mui 2) a syst:eiii whic^^ i^'ould 
take advanta-ge of th i rd - gen era t i on eqnipnienL, especially 
direct access techniques, to process requests as efficiently 
as possible. 

The user, orientation was achieved by designing a new 
retrieval language that will be described later. The 
general philosophy was to eliminate as many artificial 
constraints as possible in input forniat and at the sasne time 
to al low .the production of complex terms and expressions. 
The second objective was to reduce the 1 ai^ge amoun ts of com- 
puter time spent processing extensive narrative format 
data bases such as the ERIC files. (See Figure 1) The 
second objective was. achieved by designing GANDALF to 
bu.ild a number of indexes that are used in con junction with 
a user REQUEST (a selection criterion written with one or 
more KEYWORDS which the desired elements of the data- base 
contain or to which they logically relate) to produce a list 
of addresses that point to records in the main data base 
that relate to that REQUEST. These addresses are then used 
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Figure 1 . GANDALF overal 1 des i g n 



to directly retrieve only tliose rocorus j "I f v i n vi ' n.i t 
REQUt-ST without having to process unref eronc(M! ririteri.ii in 
the data base. 

, the indexes built can be thought of (^.s series of ii^cloy 
levels (see Fig-ure 2) where an elenieiit in one level points to 
a' related group of elements in the next level. A KHWOKi; 
at the first level is processed by a version of Don Morri^nr'' 
PATRICIA algorithm to yield a unique numi^er, r<illecl a 
PATRICIA NllMBFR. The PATRICIA NUMBER point-s into an IMniRF.CT 
POINTER TABLE which has two elements for each PATRICIA " 
NUMBER, one to be used if the KEYWORD is considered a Driiiv-iry 
term, that is, if it is expected to be the primary subject, 
author, etc., the other to be used if the KLYWORO is a 
secondary tern:. The PATRICIA NUMBER is also used as a pointer 
into an occurences table which has a primary and secondary 
entry for each member; this table describes how many times 
each KEYWORD occurs. Each element of the INDIRECT POINTER 
TABLE points to a list of elements in the DIRECT POINTER 
FILE. The items of a particular list point directly to 
records in (lie main data h.v.r wirith (f^nt.Mn Uia I I'.lYHiHvl). ' 
Then, any complex Boolean relation sp(M:irie(l for sot of 

KEYWORDS may be evaluted by performing the appropriate 
Boolean operat i ons ' on the lists which are associated with 
the different KEYWORDS. The lists are built so that the 
eVenients are, in strict ascending sequence, with a special 
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code indicating the" end of a list.. The Fnolean (v[ierat i on s 
AND, OR, and BUT NOT ossentially consist of meroing two 
or nore of these lists together. The resulting list can be 
used to directly access the records that satisfy that 
Boolean expression. 

The index building process (see Figure 3) is perfornio^' 
once initially and then repeated every time the data base 
is updated (quarterly for ERIC). The KEYSEP program builds 
a tcey-ref er ence file for each occurrence of each keyword. 
Each record contains a keyword as well as a DIRECT POINTER 
that points to the abstract that cofitains the KEYWORD. 
After being sorted alphabetically by KEYWORD, this file is 
used by PNTBLD. PNTBLD uses the new key-reference file 
merged with the key-reference file frcni the previous update 
to build the permanent DIRECT POINTER FILE and a temporary 
key-frequency file. The DIRECT POINTER FILE is a direct 
access file containing lists of DIRECT POINTERS. The 
key- frequency file has one record for cmcIi unique KlYWORI). 
Each record contains the KEYWORD, and for the primary and ^ 
secondary levels, contains the number of occurrences of that 
KEYWORD and INDIRECT POINTERS that-point to the corresponding 
lists in the DIRECT PO.INTER FILE. After the key-frequency_^^^ 
file is sorted by frequency, it is used by PATBLD to build 
the PATRICIA tables^ Since the records going into PATBLD have 
unique KEYWORDS , the PATRICIA - tabl e can -be broken up into 
several i ndependent segments . This allows the size of a 
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Figure 3. Index Building Process 
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segment to be a function of the amount of cors avai 1 abl e , 
since only one segment need be in ~ ^ it any one time. 
Sorting by frequency of occur eos that the first 

segment w 1 1 contain the most i . , ily occurring KEYWORDS. 
The underlying assumption is that most REQUESTS will be for 
frequently occurring KEYWORDS and can be satisfied by 
looking on the first few segments instead of the entire table. 
A further improvement on- this idea is to include frequency of 
usage as wel T as occurrence as the sorting criterion. Even 
if the above assumption is invalidated completely, results 
would probably be no worse than if the table was built 
randomly. Thus, to the extent that requests are consistent 
with previous requests and the data base, the time required to 
look up a request will be reduced accordingly. 



The retrieval program (see Figure 4) compiles ?. 
number of requests into several tables. Presently, requests 
are submitted in a batch mode on cards, with the future 
possibility of using interactive terminals or remote equip- 
ment remaining open. 

Each unique KEYWORD occurring in a series of requests 
is added to a temporary VOCABULARY created for each run, For 
each KEYWORD in the VOCABULARY ; an entry is made in a DIRECT- ■ 
ORY indicating position, type and length of the KEYWORD. Simul- 
taneously, a POLISH STRING is created, specifying the Boolean 
operations to be performed. For each KEYWORD in the VOCABULARY, 
a PATRICIA NUMBER is found using the previously built PATRICIA- 
TABLES. The PATRICIA NUMBER is used as an index into the IN- 
DIRECT POINTER TABLE which yields a pointer to a list in the 
DIRECT POINTER FILE and into the occurrence table which gives 
the length of that list. The Boolean operations specified in 
the POLISH STRING are then performed on the lists that corre- 
spond to the original KEYWORDS in the REQUEST. The resultant 
list of record addresses is then entered into a queue of records 
to be printed. .. Whenever the main data base can be made avail- 
able to the computer, the actual references can be printed (see 
Figure 5). The only parts of the system that are involved with 
the actual format or storage method of the main d,ata base are 
the key separating program (KEYSEP) and the final print pro- 
gram (GPRINT ) , so that differently formatted data bases such' as 
NASA abstracts, CHEM abstracts, and MARC records from the 
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Library of Congress could all be processed by GANDALF in 
their original format by niodifyinci those two relaMvely simiile 
pieces of code, thus enhancing the f 1 ex1 hi 1 i ty of the systein. 



SOUTHWEST RESEARCH ASSOCIATES 

P.O. Box 4092 
Albuquerque, New Mexico 87106 



TO: Dr. Don Mor: son 

FROM: Jock Embry 

SUBJECT: Timinq tests on GANDALF 



September :?0, 1972 



Last Saturday I ran three ERIC Searches in order to compare the different times 
required by QUERY (the current production program) and GANDALF. Following is a 
summary of the results. For each search, four times are shown. The first is the 
time for the initial run. For Q" CRY this is the time to execute a simple program 
that breaks the request into different jobs for each disk. For GANDALF it is the 
actual execution of GANDALF; that is, all processing necessary to produce preliminary 
reports and generate disk addresses of requested abstracts. The other three lines fos 
each search correspond to the time required to process each disk. For QUERY, that 
is the time to actually process and complete a search. For GANDALF it is the time to 
simply retrieve and print the requested abstracts. Since virtually no work has been 
done on tuning and- improving the GANDALF print program, those times can probably 
be improved considerably. Times are reported as wall clock time in minutes (CPU 
time in seconds). . 



Search 1 
disk 1 
disk 2 
disk 3 
Total 

Search 2 
disk 1 
disk 2 
disk 3 
Total 

Search 3 
disk 1 
disk 2 
disk 3 
Total 



QUERY 
.28 
15.40 
14.15 
13.88 
43.69 

.26 
15.98 
16.74 
14.73 
47.71 

.28 
13.83 
13. '81 
10.10 
38.02 



( .39) 
( 70.38) 
( 55.19)* 
( 50.62) 
(186,58) 

( .40) 
(134.86)** 
(142.78)** 
(1V7.57)** 
(395.61 ) 

( .42) 
(279.9.5) 
(205.06)* 
(240.99) 
(725 .'42) 



GANDALF 

.99 

.90 
1 .23 

.96 
4.08 

1 .05 

0 
0 
0 

1 .05 

1.12 

0 
0 

.93 
2.05 ■ 



3.57 
12.78 
18.32 
10.96 
45.63 

5.08 
0 
0 
0 

5.08 

12.73 

0 
0 

4.63 
17.36 



exceeded 100 hits 



■** 



abended due to disk I/O errors 
cancelled because of excessive output 



From these samples it appears GANDALF is ten to forty times as fast as QUERY in 
wall clock time and four to seventy-seven times as fast in CPU time. Since the 
final results were the same in each case, I think we have fulfilled our goal of 
producing a more efficient retrieval system for the ERIC files. 

cc: Dr. Bell 

Mick McMahan - 
Steve Baca 
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